Algorithms for Minimum Risk Chunking

نویسنده

Martin Jansche

چکیده

Stochastic finite automata are useful for identifying substrings (chunks) within larger units of text. Relevant applications include tokenization, base-NP chunking, named entity recognition, and other information extraction tasks. For a given input string, a stochastic automaton represents a probability distribution over strings of labels encoding the location of chunks. For chunking and extraction tasks, the quality of predictions is evaluated in terms of precision and recall of the chunked/ extracted phrases when compared against some gold standard. However, traditional methods for estimating the parameters of a stochastic finite automaton and for decoding the best hypothesis do not pay attention to the evaluation criterion, which we take to be the well-known F -measure. We are interested in methods that remedy this situation, both in training and decoding. Our main result is a novel algorithm for efficiently evaluating expected F -measure. We present the algorithm and discuss its applications for utility/risk-based parameter estimation and decoding.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting vs. Chunking Rules: Unsupervised ITG Induction via Minimum Conditional Description Length

We present an unsupervised learning model that induces phrasal inversion transduction grammars by introducing a minimum conditional description length (CDL) principle to drive search over a space defined by two opposing extreme types of ITGs. Our approach attacks the difficulty of acquiring more complex longer rules when inducing inversion transduction grammars via unsupervised bottom-up chunki...

متن کامل

Iterative Rule Segmentation under Minimum Description Length for Unsupervised Transduction Grammar Induction

We argue that for purely incremental unsupervised learning of phrasal inversion transduction grammars, a minimum description length driven, iterative top-down rule segmentation approach that is the polar opposite of Saers, Addanki, and Wu’s previous 2012 bottom-up iterative rule chunking model yields significantly better translation accuracy and grammar parsimony. We still aim for unsupervised ...

متن کامل

Unsupervised Transduction Grammar Induction via Minimum Description Length

We present a minimalist, unsupervised learning model that induces relatively clean phrasal inversion transduction grammars by employing the minimum description length principle to drive search over a space defined by two opposing extreme types of ITGs. In comparison to most current SMT approaches, the model learns a very parsimonious phrase translation lexicons that provide an obvious basis for...

متن کامل

Content-dependent chunking for differential compression, the local maximum approach

When a file is to be transmitted from a sender to a recipient and when the latter already has a file somewhat similar to it, remote differential compression seeks to determine the similarities interactively so as to transmit only the part of the new file not already in the recipient’s old file. Content-dependent chunking means that the sender and recipient chop their files into chunks, with the...

متن کامل

Combining Top-down and Bottom-up Search for Unsupervised Induction of Transduction Grammars

We show that combining both bottom-up rule chunking and top-down rule segmentation search strategies in purely unsupervised learning of phrasal inversion transduction grammars yields significantly better translation accuracy than either strategy alone. Previous approaches have relied on incrementally building larger rules by chunking smaller rules bottomup; we introduce a complementary top-down...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Algorithms for Minimum Risk Chunking

نویسنده

چکیده

منابع مشابه

Segmenting vs. Chunking Rules: Unsupervised ITG Induction via Minimum Conditional Description Length

Iterative Rule Segmentation under Minimum Description Length for Unsupervised Transduction Grammar Induction

Unsupervised Transduction Grammar Induction via Minimum Description Length

Content-dependent chunking for differential compression, the local maximum approach

Combining Top-down and Bottom-up Search for Unsupervised Induction of Transduction Grammars

عنوان ژورنال:

اشتراک گذاری